Entropy, KL distance, and Deviance

Jesse Brunner

From whence our metric of information?

  • entropy as a measure of information
  • KL as a measure of distance
    • added uncertainty by using an approximation for the True distribution
  • Deviance as a metric of (relative) distance
    • do not need to know what is True

A very simple example

# True model
a <- 2
b <- 1.5
sigma <- 2

x <- c(1,5,7,10)
mu <- a+b*x

# observations
y <- round(rnorm(length(mu), 
                 mean=mu, 
                 sd=sigma), 
           1)

Entropy of data | True model

\[ H(p) = -\mathbb{E}\left[ \log(p_i)\right] = -\sum_{i=1}^n p_i \log(p_i) \]

# calculate entropy of data | True model
(ps <- dnorm(y, 
             mean=mu, 
             sd=sigma)
 )
[1] 0.1972397 0.0219918 0.1209854 0.1841351
-sum(ps*log(ps))
[1] 0.9712345

Let’s fit some two simple models

m0

# fit model with just a mean
m0 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu ~ dnorm(5, 3),
    sigma ~ dexp(1)
  ), data=data.frame(x,y)
)
precis(m0)
           mean        sd     5.5%     94.5%
mu    10.060778 1.7455243 7.271093 12.850463
sigma  3.761734 0.9382529 2.262224  5.261243

m1

# fit model with a mean and slope
m1 <- quap(
  alist(
    y ~ dnorm(mu, sigma),
    mu ~ a + b*x,
    a ~ dnorm(5, 3),
    b ~ dnorm(0,1),
    sigma ~ dexp(1)
  ), data=data.frame(x,y)
)
precis(m1)
          mean        sd      5.5%    94.5%
a     4.681411 1.4028117 2.4394467 6.923375
b     1.285321 0.2185176 0.9360877 1.634554
sigma 1.576968 0.4491855 0.8590828 2.294853

Let’s fit some two simple models

Cross entropy from using m0 to approximate Truth

\[ H(p, q) = -\sum_{i=1}^n p_i \log(q_i) \]

## cross entropy
(qs <- dnorm(y, 
             mean=preds_m0[1:4],  # probs if we use m0
             sd=mean(extract.samples(m0)$sigma)) )
[1] 0.02717686 0.06561993 0.05212724 0.02751893
-sum(ps*log(qs))
[1] 1.790003
# added entropy by using m0 to approximate True
-sum(ps*log(qs)) - -sum(ps*log(ps))
[1] 0.8187685

Cross entropy from using m1 to approximate Truth

\[ H(p, q) = -\sum_{i=1}^n p_i \log(q_i) \]

## cross entropy
(rs <- dnorm(y, 
             mean=preds_m1[1:4],  # probs if we use m1
             sd=mean(extract.samples(m1)$sigma)) )
[1] 0.09797833 0.06334920 0.21921167 0.18199660
-sum(ps*log(rs))
[1] 1.016212
# added entropy by using m1 to approximate True
-sum(ps*log(rs)) - -sum(ps*log(ps))
[1] 0.04497729

Kullback-Leibler divergence

\[ D_{KL}(p,q) = \sum_{i=1}^n p_i\left[ \log(p_i) - \log(q_i) \right] \] measures the added entropy from using a model to approximate True

# added entropy by using m0 to approximate True
-sum(ps*log(qs)) - -sum(ps*log(ps))
[1] 0.8187685
## Dkl(p,q)
sum(ps*(log(ps)-log(qs)))
[1] 0.8187685
## --> it's the same!
## added entropy by using m1 to approximate true
-sum(ps*log(rs)) - -sum(ps*log(ps))
[1] 0.04497729
## Dkl(p,r)
sum(ps*(log(ps)-log(rs)))
[1] 0.04497729
## --> it's the same!

Compare KL distances of models

\[ \begin{align} D_{KL}(p,q) - D_{KL}(p,r) & = \sum_{i=1}^n p_i\left[ \log(p_i) - \log(q_i) \right] - \sum_{i=1}^n p_i\left[ \log(p_i) - \log(r_i) \right] \\ & = \sum_{i=1}^n p_i\left[ - \log(q_i) \right] - \sum_{i=1}^n p_i\left[ - \log(r_i) \right] \end{align} \]

# Difference in KL distances between m0 and m1
sum(ps*(log(ps)-log(qs))) - sum(ps*(log(ps)-log(rs)))
[1] 0.7737912
# We can get the same result if 
# we ignore the first log(ps) term in both quantities
-sum(ps*log(qs)) - -sum(ps*log(rs))
[1] 0.7737912

What if we do not know the Truth?

We almost have all of the \(p_i\) (ps) out of the quantity, but not quite

If we take out it out completely, we end up with log-probability scores, which are just unstandardized

\[ \begin{align} D_{KL}(p,q) - D_{KL}(p,r) & = \sum_{i=1}^n p_i\left[ - \log(q_i) \right] - \sum_{i=1}^n p_i\left[ - \log(r_i) \right] \\ & \propto \sum -log(q_i) - \sum -log(r_i) \end{align} \] So we use log probabilities to describe fit and compare them between models <phew!>

But not quite: 1) lppd

Have been pretending we have a single value for our expectations (the MAP)

In actuality, have a full distribution (posterior)

Enter the log point-wise predictive density (lppd)

\[ \text{lppd}(y,\Theta)=\sum_i^n \log \frac{1}{S}\sum_s^Sp(y_i|\Theta), \]

# m0
sum(log(qs))
[1] -12.87621
sum(lppd(m0))
[1] -12.58803
# m1
sum(log(rs))
[1] -8.303587
sum(lppd(m1))
[1] NaN

But not quite: 2) deviance

And we actually use deviance = \(-2\times \text{lppd}(y,\Theta)\)

(smaller is better)

# m0
-2*sum(lppd(m0))
[1] 25.23876
# m1
-2*sum(lppd(m1))
[1] 16.74528

These are our metrics of fit! See that m1 is way better at fitting our data than m0